Population aging has been identified as a high-priority problem in many developed countries. It describes a phenomenon where the country’s population distribution is being shifted towards the elderly population with a decline in the proportion of the younger population. This could be resulting in overburdening of the welfare system, shortage in labor supplies, decline in productivity and many other serious consequences that could ultimately lead to economical hardship.
Currently, the only feasible solution is to improve birth rates, which has been heavily incentivized in many developed countries like Korea and New Zealand, in order to combat the problem of Population Aging.
In this paper, we will investigate Canada's birth rates by province, which is a key determinant for whether Canada is subjected to population aging. By analyzing the birth rate by province, we can further investigate whether the provincial birth rate is associated with other influencing social factors such as provincial GDP per capita, unemployment rate, crime rate, and education level. Therefore, the paper will reach a conclusion on the effect of the potential contributing factors by analyzing the correlation between these factors and the birth rate. From our result, the policy makers can visualize which practice is more effective in terms of increasing the birth rate and to effectively combat the problem with population aging.
The following Guiding questions aim to establish the existance of the problem of Population Aging in Canada, and to visualize the severity of the problem of low birth rate in each province. Then, We will analyze the correlation between each proposed social factor and birth rate. By cross-analyzing each province, we can have enough data to accurately determine whether the said social factors is impacting the birth rate.
We will only be using the year range from 2000 to 2020 due to data set limitations, also because earlier data is less indicative for analyzing recent trends.
1. Is Canada subjected to population aging?
2. If Canada is subjected to population aging, What is the severity of the problem with low birth rate for each province?
3. What are the education levels in each Canadian province, and how will they affect the birth rate?
4. Will GDP and crime rate have an impact on the birth rate?
5. How much the unemployment affect the birth rate?
We will puch forward our project through two parts, firstly... Secontly...
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import geopandas as gpd
import plotly as plotly
import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px
import csv
import warnings
warnings.filterwarnings("ignore") # to hide/ignore warnings
from urllib.request import urlopen
import json
1. Is Canada subjected to population aging?
In this question, we will explore the trend of the elder population (age 65 and above) and the younger population (age 14 and below) from year 2000-2020, and the change in population proportion. By visualizing the proportion and the trend of the young and old population, we can easily contrast whether Canada is subjected to population aging.
Data Source: The World Bank, Statistics Canada
Data wrangling for Q1
(This is a general guideline, please refer to more detailed procedures from in-line comments)
Then
Finally
#importing dataset of young'population (age 15 and below)
#Data source from The World Bank. Accessesible at :https://data.worldbank.org/indicator/SP.POP.0014.TO?end=2020&locations=CA&start=1 960&view=chart
young_pop=pd.read_csv('pop_young.csv' , index_col=[0])
print('\n')
print('Before wrangling the data looks like this:')
display(young_pop.head())
#Filter out the interested year range: 2000-2019
cols=list(young_pop.columns)
young_pop= young_pop[cols[0:1]+cols[43:]]
#filter out for ONLY the canadian population
young_pop=young_pop.loc[young_pop['Country Code']== 'CAN']
young_pop.drop("Country Code", axis=1, inplace=True)
print('\n')
print('After wrangling the Data Looks like this')
print('\n')
display(young_pop.head(5))
#Importing Older population dataset, data from The World Bank
#Available at: https://data.worldbank.org/indicator/SP.POP.65UP.TO?end=2020&locations=CA&start=1960&view=chart
old_pop=pd.read_csv('pop_old.csv')
print('\n')
print('Before wrangling the data looks like this:')
display(old_pop.head(5))
#Filter out the interested year range: 2000-2019 for Canada only
colso=list(old_pop.columns)
old_pop= old_pop[colso[0:1]+colso[-21:]]
old_pop=old_pop.loc[old_pop['Country Name']== 'Canada']
print('\n')
print('After wrangling the Data Looks like this')
print('\n')
display(old_pop)
#Importing and cleaning Overall Canadian population, data from Statistics Canada
#Available at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1710000501&pickMembers%5B 0%5D=1.1&pickMembers%5B1%5D=2.1&cubeTimeFrame.startYear=1995&cubeTimeFra me.endYear=2020&referencePeriods=19950101%2C20200101.
#This dataset consists of age groups which could potentially used to replace the above datasets, but we chose to use the above dataset because this one is only the estimated population
#so, the above two datasets provide more accuracy for the age group population.
all_pop=pd.read_csv("Canadian Population.csv",thousands=",")
print('\n')
print('Before wrangling the data looks like this:')
display(all_pop.head())
#Filter out All ages population which is the overall Canadian population at each year.
all_pop=all_pop.loc[all_pop['Age group 3 5']== 'All ages']
#change the sturborn comma seperated value to float.
all_pop['2000'] = all_pop['2000'].str.replace(',', '').astype(float)
#Change the default name to a readable name
all_pop.rename(columns={'Age group 3 5': 'All Canadian Population'},inplace=True)
print('\n')
print('After wrangling the Data Looks like this')
display(all_pop)
#Combining the 3 processed datasets together for future work.
print('\n')
print('Concatenating datasets without cleaning looks like this')
df=pd.concat([young_pop,old_pop,all_pop])
display(df.head())
#Cleaning the datasets, setting appropriate index, transform the dataset into more readable frame.
df.reset_index(inplace=True)
df= df.transpose()
df.reset_index(inplace= True)
df.columns=df.iloc[0]
#Renaming columns into more readable names
df=df.rename({'index':'Year', 'Canada': "Young population (15 and below)"},axis=1)
df=df.rename({35:'Old Population (65 and above)', 0: "Overall"},axis=1)
#skip the first row which was used to rename column names
df = df.iloc[1: , :]
df.dropna(inplace=True)
print('\n')
print('After cleaning the dataset looks like this')
display(df.head())
#Creating line chart to show the trend of the old and young population from year 2000 to 2020
fig1 = go.Figure()
fig1.add_trace(go.Scatter(
x=df['Year'],
y=df['Young population (15 and below)'],
name="Young population (15 and below)" ))
fig1.add_trace(go.Scatter(
x=df['Year'],
y=df['Old Population (65 and above)'],
name="Old Population (65 and above)" ))
fig1.update_layout(
title="Trend of Canadian Young and Old Population from 2000 to 2020",
xaxis_title="Year",
yaxis_title="Population",
font=dict(
size=14,
color="black"))
fig1.show()
At this point, we can see a clear upward trend for the elder population since 2000.
However, the young population almost stayed as a flatline, does it mean everything's good? The answer is no, we will see why it is when we translate it into proportion.
#Calculate population proportion and add them to the datafram
df['Overall']=pd.to_numeric(df['Overall']) #converting populatin from str-type to float
df['Young_proportion']=df['Young population (15 and below)'].div(df['Overall'].values)*100 #dividing population group by overall population, *100 to get percentage
df['Old_proportion']=df['Old Population (65 and above)'].div(df['Overall'].values)*100
display(df.head(5))
#Plotting the trend of proportion of the two age groups from 2000 to 2019
fig2 = go.Figure()
fig2.add_trace(go.Scatter(
x=df['Year'],
y=df['Young_proportion'],
name="Proportion of Young Population (%)" ))
fig2.add_trace(go.Scatter(
x=df['Year'],
y=df['Old_proportion'],
name="Proportion of Old Population (%)" ))
fig2.update_layout(
title="Trend of Canadian Young and Old Population Proportion from 2000 to 2020",
xaxis_title="Year",
yaxis_title="Proportion(%)",
font=dict(
size=14,
color="black"))
fig2.show()
Importance of analysis from Q1
Here, we can clearly see the proportion of elder population has been climbing up, while the young population proporiton has been going stright downwards.
Since 2015, we witnessed the proportion of elder population surpassed the young population, and the difference of proportion of the two age groups is getting larger every year.
This is important because the chart clearly shows the country is subjected to population aging. Without intervention, the aforementioned serious consequences from introduction can very likely come true in Canada.
2. What is the severity of low birth rate for each province?
Since we can conclude Canada is subjected to Population Aging from Q1, and we know from introduction that the only feasible solution to combat the problem is to improve birth rate to increase proportion of young population to fill the labor demands. Therefore, In Q2, we hope to illustrate live birth rate from each province to further visualize the severity of the problem with birth rate in each province.
Data Source: Github Open Source Dataset, Statistics Canada
Data wrangling for Q2
(This is a general guideline, please refer to more detailed procedures from in-line comments)
Then
Finally
#Importing Birth Rate dataset from Statistics Canada
#Available at https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310041801
br=pd.read_csv('Crude_Birth_rate.csv', index_col=[0])
print('\n')
print('raw data looks like this')
display(br.head())
br=br.dropna() #remove the empty space that was intended for visual appeal in Excel.
br.replace("..", np.nan, inplace=True) # remove unavailable data denoted as '..' by Statistics Canada
br.dropna(inplace=True) #Deleting the empty cells
br.reset_index(inplace=True)
br['Canada, place of residence of mother 12']=br['Canada, place of residence of mother 12'].apply(lambda x: x.split(',')[0]) #Getting the province name, without the tailing description.
br.rename(columns={'Canada, place of residence of mother 12':'Province'},inplace=True)
br=br.iloc[1:,:] # skip 1st row which was used to name column.
#tweak the dataframe for line graph only,dedicated for graph Figure 3.
br_line_data= br.transpose().reset_index()
br_line_data.columns=br_line_data.iloc[0]
br_line_data= br_line_data.iloc[1:, :]
br_line_data.rename(columns={'Province':'Year'},inplace=True)
print('\n')
print('processed data for line plot for figure 3')
display(br_line_data.head())
#Plotting the trend of live birth rate
fig3 = go.Figure()
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Newfoundland and Labrador'],
name="Live Birth Rate of Newfoundland and Labrador" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Prince Edward Island'],
name="Live Birth Rate of Newfoundland and Labrador" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Nova Scotia'],
name="Live Birth Rate of Nova Scotia" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['New Brunswick'],
name='Live Birth Rate of New Brunswick'))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Quebec'],
name="Live Birth Rate of Quebec" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Ontario'],
name="Live Birth Rate of Ontario" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Manitoba'],
name="Live Birth Rate of Manitoba" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Saskatchewan'],
name="Live Birth Rate of Saskatchewan" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Alberta'],
name="Live Birth Rate of Alberta" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['British Columbia'],
name="Live Birth Rate of British Columbia" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Northwest Territories'],
name="Live Birth Rate of Northwest Territories" ))
fig3.add_trace(go.Scatter(
x=br_line_data['Year'],
y=br_line_data['Nunavut'],
name="Live Birth Rate of Nunavut" ))
fig3.update_layout(
title="Live Birth Rate of Each Canadian Province from 2000 to 2020",
xaxis_title="Year",
yaxis_title="Live Birth Rate(# birth per 1000 women)",
font=dict(
size=14,
color="black"))
fig3.show()
From the graph, we can see a downward trend for live birth rate in almost all provinces in Canada, especially since the year 2015. It is possible for experts to investigate what happened in year 2015 that caused the all-around decline in the country's birth rate.
#Creating new dataset for illustrating average live birth rate since year 2015 with map.
#we chose year 2015 - 2020 because we established 2015 is the year when young-population was surpassed by elder population, and where a clear decline in live birth rate in observed.
#Also, we want a more recent live birth rate to make our finding more up-to-date. Thus the year range 2015-2020.
br_for_plot=br[['Province','2015','2016','2017','2018','2019','2020']]
cols = br_for_plot.columns.drop('Province') #get rid of string-typed province for calculation.
br_for_plot[cols] = br_for_plot[cols].apply(pd.to_numeric)
br_diff=br_for_plot[cols].diff(axis=1) # calculate the difference of live birth rate from each year
br_diff.dropna(axis=1,inplace=True)
br_diff['Average']=br_diff.mean(axis=1)# Calculate the average of the live birth rate from year 2015 to 2020.
br['Average BR Since 2015']=br_diff['Average']# attaching the calculated mean birth rate to the main dataframe.
#Loading Canadian geojson data for mapping.
with urlopen('https://raw.githubusercontent.com/codeforgermany/click_that_hood/main/public/data/canada.geojson') as response:
provs = json.load(response)
#Create a dictionary to store provincial ID from geojson for mapping purpose.
prov_id_map={}
for feature in provs['features']: #use for-loop to avoid writing dozens of 'if' statements.
feature['id']=feature['properties']['cartodb_id']
prov_id_map[feature['properties']['name']]=feature['id']
br['id']=br['Province'].apply(lambda x: prov_id_map[x]) #creating a column 'id' to match the provincial id to each province from the main dataframe.
warnings.filterwarnings("ignore") #skip warning for version-compatibility.
print('\n')
print('After wrangling and processing, this is the dataframe that are to be graphed')
display(br.head())
#Using plotly.express.choropleth to map out the Canadian provinces with average live birth rate.
fig4=px.choropleth(br,locations='id',
geojson=provs,
color= 'Average BR Since 2015', #this is the average live birth rate we calcualted from year 2015 to 2020.
hover_name="Province",
range_color=(-0.7, 0), #upper bound set to 0 so all provinces with positive birht rate will be catagorized to same color
color_continuous_scale=px.colors.sequential.Plasma)
fig4.update_geos(fitbounds='locations',visible=False)
fig4.show()
Important finding from the analysis in Q2
From the map above, we can see that all canadian provinces are having an average negative birth rate. Negative birth rate means each year we are seeing less and less new borns compared to the previous year.
Combined with the overall declining birth rate trend(line graph fig 3) since year 2000, we can further impressionate provincial governments about the severity of low birth rate in their province, so that every province government can realise the localzied problem of population aging in their jurisdiction, thus be motivated to make changes.
In this quesion, we will use education levels dataset from Government of Canada to find is there relationshp between education level and birth rate. Firstly, we analyze data from Canada as a whole, and then using plotly_express to visulize each province's situation, finially, calculate correlation coefficient and give our conclusion.
Data Source: Educational attainment in the population aged 25 to 64,Government of Canada
Data process manually
Data wrangling through programs
Then
br=pd.read_csv('Crude_Birth_rate.csv', index_col=[0])
print('\n')
print('raw birth data ')
display(br )
provinceList = [ 'Canada','Newfoundland and Labrador','Prince Edward Island','Nova Scotia','New Brunswick',
'Quebec','Ontario','Manitoba','Saskatchewan','Alberta','British Columbia',
'Yukon','Northwest Territories','Nunavut']
#br_raw = pd.DataFrame("REF_DATE": "2000","GEO":provinceList "VALUE": )
#importing dataset of education levels
#Data source from Government of Canada https://open.canada.ca/data/en/dataset/c9c59a8f-ebe9-4444-a543-63261372c648
# step1 : read and wrangling Educational file
rdata_edu_r = pd.read_csv("./Education-raw.csv")
display("Education Raw Data:",rdata_edu_r.head(3),len(rdata_edu_r))
rdata_edu_r['Population characteristics'] = rdata_edu_r['Population characteristics'].str.strip()
rdata_edu_r2 = rdata_edu_r.loc[ (rdata_edu_r['REF_DATE'] > 2009 )
& (rdata_edu_r['Educational attainment level'] != 'Trades')
& (rdata_edu_r['Educational attainment level'] != 'Total, all levels')
& (rdata_edu_r['Population characteristics'] == 'Total population') ]
display("After wrangling Education Data:",rdata_edu_r2.head(3) ,len(rdata_edu_r2))
# step2 : compare canada's birth rate with education level
# define some common variables
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
colPar = ['c','g','b','y'] # line color
eduLevel = ['Less than high school','High school','College','University'] # education levels
rdata_edu = rdata_edu_r2
# plot canada's education
axnum = 0
for i,j in zip(colPar,eduLevel):
rdata_edu2 = rdata_edu.loc[(rdata_edu['GEO'] == 'Canada') & (rdata_edu['Educational attainment level'] == j) &
(rdata_edu['REF_DATE'] > 2009 )& (rdata_edu['REF_DATE'] <2020)]
xdata = rdata_edu2['REF_DATE']
ydata = rdata_edu2['VALUE']
if axnum < 2:
ax1.plot(xdata,ydata,color=i, marker='+',label=j)
else:
ax2.plot(xdata,ydata,color=i, marker='+',label=j)
axnum += 1
# plot canada's birthrate
rdata_br = pd.read_csv("./birth rate.csv")
display("Birth Raw Data:",rdata_br.head(5))
rdata_br2 = rdata_br.loc[(rdata_br['GEO'].str.contains('Canada')) &
(rdata_br['Characteristics'] == 'Total fertility rate per 1,000 females')&
(rdata_br['REF_DATE'] > 2009 )& (rdata_br['REF_DATE'] <2020)]
birthnum = rdata_br2['VALUE']/100 # adjust scale of Y-axis for birthrate
ax2.plot(xdata,birthnum,color='r', marker='*',label="Birth Rate")
ax2.legend(loc='best')
ax1.plot(xdata,birthnum,color='r', marker='*',label="Birth Rate")
ax1.legend(loc='best')
print('\n')
From the picture above about Canada's birth rate and education level, we can observe an inverse relationship of higher-education and birth rate. From the figure on the right, we can see birth rate is getting lower with higher percentage of higher-education level, which is again confirmed by the figure on the left, where birth rate dips with lower-education dips(aka more higher education)
But we are currently only using data from Canada as a whole. In order to verify whether this trend is correct, we need to compare from each province To get the correlation, we pick "university" education level as variable against birth rate.
# step3 : compare each province's birth data with higher education level
# reconstruct birthrate dataset, only keep date,geo and value,and add new column Name = 'Birth Level'
display("Raw Birth Dataset:",rdata_br.head(3))
rdata_br3 = rdata_br.loc[ (rdata_br['Characteristics'] == 'Total fertility rate per 1,000 females')
& (rdata_br['GEO'] != 'Northwest Territories including Nunavut') #this is not a standard province, discard it
& (rdata_br['REF_DATE'] > 2009 )& (rdata_br['REF_DATE'] <2020)] #get all provinces' birth rate dataset
rdata_br3['GEO'] = rdata_br3['GEO'].str.replace(str(', place of residence of mother'), '') #modify raw data GEO column
rdata_br4 = rdata_br3.loc[:,['REF_DATE','GEO','VALUE' ]]
rdata_br4['Name'] = 'Birth Level'
rdata_br4['VALUE'] = rdata_br4['VALUE']/100
rdata_br5 = rdata_br4.loc[ (rdata_br4['GEO'] != 'Northwest Territories including Nunavut')]
display("Constructed Birth Data:",rdata_br5.head(3))
# reconstruct education dataset, only keep date,geo and value ,and add new column Name = 'university'
rdata_edu3 = rdata_edu.loc[(rdata_edu['Educational attainment level'] == 'University')
& (rdata_edu['REF_DATE'] > 2009 )& (rdata_edu['REF_DATE'] <2020)]
rdata_edu4 = rdata_edu3.loc[:,['REF_DATE','GEO','VALUE' ]]
rdata_edu4['Name'] = 'University'
display("Constructed Education Data:",rdata_edu4.head(3))
# combine new dataset
rdata_combine = rdata_br5.append(rdata_edu4)
display("Combined Dataset(Education and Birth) :",rdata_combine.head(3) )
# using ploy express lib to draw scatter graphs, to compare education with birth rate, for each province through 10 years
fig = px.scatter(rdata_combine,x ="REF_DATE",y ="VALUE",animation_frame = "GEO" ,color = "Name",width=800, height=400,
title = 'Compare Birth with Education, base on all provinces and 10 years scope' )
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 500 # control animation speed
fig.show()
# step4: calculate correlation base on Canada as a whole
X = rdata_br2['VALUE']
Y = rdata_edu2['VALUE']
result = np.corrcoef(X, Y)
print("The correlation value is {}, that is, Higher-education and Birth Rate in Canada has are strongly negatively correlated\n".format(result[0,1]))
Importance of findings from Guiding Question #3
Base on data from the past 10 years from all provinces, we can see all provinces share the same correlation. The calculated correlation between birth rate and higher education level('University') is -0.98. So, we can conclude that higher education and birth rate are closely and negatively correlated. Which means higher education is strongly associated with lower birth rate.
In this quesion, we will use GDP,crime two datasets from statistics canada, to find whether they could affect birth rate. Firstly, we use matplotlib.pyplot to visulize each province's comparation situation, and then calculate correlation coefficient for each province, finally we will use chart to support our conclusion.
Data Source1: Gross domestic product (GDP) at basic prices,Statistics Canada
Data process manually
Data wrangling through programs
Data Source2: Crime Severity Index,Statistics Canada
Data process manually
Data wrangling through programs
#importing dataset of GDP
#Data source from Statistics Canada. Accessesible at :https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3610040201
#importing dataset of crime
#Data source from Statistics Canada. Accessesible at :https://www150.statcan.gc.ca/n1/pub/85-002-x/2021001/article/00013-eng.htm
# step1 : read GDP file
rdata_gdp_r = pd.read_csv("./GDP-raw.csv" ,header = 1,delimiter="\t",encoding = "ISO-8859-1" )
display("GDP Raw Data:",rdata_gdp_r.head(3)) # the orignal file header has lots notes , so the following will show NaN mostly
col_name = ['Geography','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010',
'2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
rdata_gdp_r2 = rdata_gdp_r[10:23]
rdata_gdp_r2.columns = col_name # change column name
# change all column from string to number
i = 1
while i < 21 :
rdata_gdp_r2[col_name[i]] = rdata_gdp_r2[col_name[i]].str.replace(',', '').astype(float)
i += 1
rdata_gdp = rdata_gdp_r2
display("After wrangling GDP Data:",rdata_gdp.head(3),len(rdata_gdp) )
#read Crime file
rdata_crime = pd.read_csv("./crime.csv" )
display("crime Raw Data:",rdata_crime.head(5))
# step2 :use matplotlib plot to compare GDP , crime and birth rate for each province among 10 years, in seperate chart
def Plotbypro(ProName,Birthrate ):
xdata = np.array([ 2010 , 2011 , 2012 , 2013 , 2014 , 2015 , 2016 , 2017 , 2018 , 2019 ])
# 10 years GDP data for one province ProName
rdata_gdp_p1 = rdata_gdp.loc[rdata_gdp['Geography'] == ProName]
ydata_gdp = np.array([rdata_gdp_p1['2010'],rdata_gdp_p1['2011'],rdata_gdp_p1['2012'],rdata_gdp_p1['2013'],
rdata_gdp_p1['2014'],rdata_gdp_p1['2015'],rdata_gdp_p1['2016'],rdata_gdp_p1['2017'],
rdata_gdp_p1['2018'],rdata_gdp_p1['2019']])
# 10 years crime data for one province ProName
ydata_crime = np.array(rdata_crime[ProName][12:22]*300).tolist()
ax1.plot(xdata,ydata_gdp, color='g', marker='+',label='GDP')
ax1.plot(xdata,ydata_crime,color='b', marker='+',label='Crime')
ax1.plot(xdata,Birthrate*3000,color='r', marker='*',label="birth rate")
plt.title(ProName)
plt.xticks(rotation = 60)
plt.legend(loc='best')
#calculate correlation between crime,gdp with birth rate for each province
ydata_gdp_l = np.transpose(ydata_gdp)
corgdp = np.corrcoef(Birthrate, ydata_gdp_l)
corcrime = np.corrcoef(Birthrate, ydata_crime)
return (corgdp[0,1],corcrime[0,1])
# call function and use loop to draw all charts for 10 provinces
fig = plt.figure(figsize=(20,16))
j = 1
provinceList = [ 'Newfoundland and Labrador','Prince Edward Island','Nova Scotia',
'New Brunswick','Quebec','Ontario','Manitoba','Saskatchewan',
'Alberta','British Columbia']
# discarded Yukon, 'Northwest Territories', 'Nunavut' because less data
cor_gdp = []*10
cor_crime = []*10
for i in provinceList:
ax1 = fig.add_subplot(3,4,0+j)
br_onepro = rdata_br5.loc[rdata_br5['GEO']==i]
br_onepro2 = np.array(br_onepro['VALUE'])
gdp,crime = Plotbypro(i,br_onepro2)
cor_gdp.append(gdp)
cor_crime.append(crime)
j += 1
Base on data from the past 10 years from all provinces, we can see there is matching trend between GDP and birth rate, but we can't observe any pattern between birth rate and crime rite.
In order to further prove this observation, we calulate and plot the correlation between crime, gdp and birth rate below.
# step3 :plot correlation for each province
fig = plt.figure(figsize=(10,4))
ax1 = fig.add_subplot(1,2,1)
plt.bar(provinceList, cor_crime,width = 0.7,color='r')
plt.title("crime and birth correlation")
plt.xticks(rotation=90)
ax1 = fig.add_subplot(1,2,2)
plt.bar(provinceList, cor_gdp,width = 0.7)
plt.title("gdp and birth correlation")
plt.xticks(rotation=90)
plt.show()
Importance of findings from Q4
At this point, we can conclude that there is no relationship between crime rate and birth rate from the cross-province correlation analysis because we don't observe a uniform pattern.
However, we can see that birth rate and gdp has opposite correlation. Especially in Ontario, Nova scotianowe and British Columbia, where the correlation has breached -0.8, showing strong negative correlation between GDP and birth rate.
So, we can conclude that, GDP and Birth rate are negatively correlated, and there is no relationship between crime and birth rate.
In this quesion, we will use unemployment dataset from Statistics Canada to find is there relationshp between unemployment and birth rate. Firstly, we will use sklearn.linear_model to calculate R-square value for each province, and then visulizing the result as well as describing our analyzing result.
R-squared (R2), is a statistical measure that explains what extent the variance of one variable(independent) explains the variance of the second variable(dependent) . R-squared value means: $^{[12]}$
Data Source: Unemployment rate,Statistics Canada
Data process manually
Data wrangling through programs
Then
#importing dataset of unemployment
#Data source from Statistics Canada. Accessesible at :https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410032702
# step1 : read file
f = open(r'./unemployment-raw.csv','r')
reader = csv.reader(f)
rdata_un_r = pd.DataFrame(reader,dtype=str)
display("unemployment Raw Data:",rdata_un_r.head(5)) # orignal file has lots of non data rows
rdata_un_r2 = rdata_un_r.loc[12:45,0:22] # slice column and row
col_name = ['Geography 3','Labour force characteristics',
'2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010',
'2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
rdata_un_r2.columns = col_name # rename sliced dataset
rdata_un = rdata_un_r2
rdata_un2 = rdata_un.loc[rdata_un['Labour force characteristics'] == 'Unemployment rate 4']
display("After Wrangling unemployment Data:",rdata_un.head(5) )
#get main provinces' birth rate dataset , from 2000 to 2019
rdata_br = pd.read_csv("./birth rate.csv")
rdata_br['GEO'] = rdata_br['GEO'].str.replace(str(', place of residence of mother'), '')
rdata_br6 = rdata_br.loc[ (rdata_br['Characteristics'] == 'Total fertility rate per 1,000 females')]
rdata_br7 = rdata_br6.loc[ (rdata_br6['GEO'] != 'Northwest Territories including Nunavut')
& (rdata_br6['GEO'] != 'Nunavut') & (rdata_br6['GEO'] != 'Northwest Territories')
& (rdata_br6['GEO'] != 'Yukon') ] # unemployment dataset has no these provinces
display("Modified birth rate Data :",rdata_br7.head(3) )
# step2 : define function, use sklearn package to calculate linear regression r Square
def getRSQ(ProName) :
#get birth rate for one province
x_br = np.array(rdata_br6.loc[rdata_br6['GEO'] == ProName ]['VALUE']/100).tolist()
#get unemployment data for one province
rdata_un3 = rdata_un2.loc[rdata_un['Geography 3'] == ProName ]
y_un = []
ListYear = np.arange(2000,2020 )
for i in ListYear:
a = np.array(rdata_un3[str(i)]) # unemployment file is horizontal table,read all value from year 2000 to 2020
#y_un.extend(a)
y_un.append(float(a))
#calculate linear regression r square
x_br2 = np.array(x_br).reshape((-1, 1))
model = LinearRegression()
model = LinearRegression().fit(x_br2, y_un)
r_sq = model.score(x_br2, y_un)
return r_sq
#step 3: use look to calculate r-square value for each province
provinceList = [ 'Newfoundland and Labrador','Prince Edward Island','Nova Scotia',
'New Brunswick','Quebec','Ontario','Manitoba','Saskatchewan',
'Alberta','British Columbia']
sq_un=[]
for i in provinceList :
r_sq_un = getRSQ(i)
sq_un.append(r_sq_un)
# step4 : visulize r square ( independent variable is unemploy , dependent variable is birth rate; for each province and 20 years)
fig = plt.figure(figsize=(9,5))
plt.bar( provinceList, sq_un, width=0.9,color=['r', 'g', 'b','y'])
plt.xticks(rotation = 90)
plt.ylim((0,1))
plt.ylabel("R-squared value")
index = np.arange(len(sq_un))
for a,b in zip(index,sq_un):
plt.text(a, b+0.05, '%.2f'%b, ha='center', va= 'bottom',fontsize=7)
plt.title("R-square (unemployment and birthrate)")
fig.show()
Importance of finding from Q5
From above we can see, the R-square are all less than 0.3, that is, only very little variance of birth rate can be explained by variance of unemployment. We can conclude that employment rate has very weak effect with birth rate . Thus this should be in low-priority regards to improving birth rate.
In this project, we succesfully visualized and analyzed the topic of aging in Canada, especially the underlying social factors that are contributing to the low birth rate as the root problem.
In summary, our conclusions are as following:
From the analysis of our proposed contributing social factors to a low birth rate, our analysis conclude:
Based on our findings, we hope to inspire government officals to start implementing changes to combat the problem with population aging and low birth rate. Furthermore, future studies can prioritize the focus based on our result, such as elaborating on the specifics on effects of GDP per capita and higher-education level when making changes.